1 INTRODUCTION

Intracellular infections of epithelial cells are hypothesised to produce expression of gene pathways related to immune response, infection response, and inflammation. To test this hypothesis, Salmonella Typhimurium and the Hela-229 cell line were chosen. Salmonella are intracellular bacteria, which infect and replicate within both epithelial cells and macrophages, causing acute gastroenteritis in humans. The HeLa-229 human cell line is epithelial in origin, so will constitute an acceptable host for Salmonella.

This experiment will test this hypothesis using a DESeq2 analysis, with a two-group experimental design.

2 RESULTS

2.1 Overview

An DESeq2 analysis was conducted on data comparing gene expression in HeLa-229 cells infected with Salmonella Typhimurium with a control group. The analysis was performed with the DESeq2 tool, using the R package of the same name.

2.1.1 Data Set

Title: Transcriptomic profiling of HeLa-229 cells infected with Salmonella Typhimurium
Accession Number: SRP034009 [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP034009]
BioProject: PRJNA231559 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA231559]
File URL [http://duffel.rail.bio/recount/v2/SRP034009/rse_gene.Rdata]

2.1.2 Experimental Design

The data set was a two-group experimental design with three replicates in each group. One group was the Salmonella infected cells, the second, the control group.

2.1.3 Analysis Process

The DESeq2 R package was used to perform the Differential Gene Expression Analysis (DGE) and the clusterProfiler R package for the Gene Set Enrichment Analysis (GSEA).

  • An initial analysis reviewed the counts, and their distribution to confirm quality and statistical criteria were satisfied.
  • A Principle Component Analysis was performed for quality assessment and exploratory analysis.
  • A Volcano Plot was generated, highlighting significant expression of genes.
  • A HeatMap was generated for the most over and under expressed genes.
  • A Gene Set Enrichment Analysis was performed to determine to most significantly expressed pathways.

2.2 Principal Component Analysis

A Principal Component Analysis (PCA) was performed on the data which showed that first 2 components explain TODO 79% of the variance, and the first four components approximately TODO 94%.

The component plots show strong differentiation on the PC1 axis for the control and condition groups, with tight clustering for the control replicates, and Salmonella replicates split into two clusters, with Salmonella Replicate 1 differing from the other 2 Salmonella replicates. The heatmap shows a different pattern of of expression for Salmonella replicate and the analysis discussed in that section.

The interactions between the PC2 and PC3 are less informative, with clustering of the control replicates only. The remaining principle components provided no useful insights.

2.3 DESeq2 Analysis

The DESeq2 analysis was generated and a MA plot generated, which showed decreasing variability with increased counts, and an approximately equal proportion of over and under expressed genes.

## 
  |                                                                                                          
  |                                                                                                    |   0%
  |                                                                                                          
  |==================================================                                                  |  50%
  |                                                                                                          
  |====================================================================================================| 100%
## 
## 
  |                                                                                                          
  |                                                                                                    |   0%
  |                                                                                                          
  |==================================================                                                  |  50%
  |                                                                                                          
  |====================================================================================================| 100%
## 
  |                                                                                                          
  |                                                                                                    |   0%
  |                                                                                                          
  |==================================================                                                  |  50%
  |                                                                                                          
  |====================================================================================================| 100%

The table below shows all the significantly differentially expressed genes.

2.4 Volcano Plot

The top most significantly expressed genes included CCL2, IL6, CXCL2, CTSL and all of these are involved in inflammation or immune response pathways. Since they are differentially expressed, the presence of Salmonella is a possible cause. The underlying presence of HPV-18 in the cell line may also explain some of this expression1, but this cannot be determined with the experiment design.

Overall the plot shows conclusively that significant over and under expression occurred, with somewhat more over-expression than under-expression.

2.5 HeatMap

A review of the most over-expressed and under-expressed genes was found to broadly support the hypothesis. However, the value of interpreting individual gene expression levels when investigating complex pathways such as immune or infection responses was limited.

2.5.1 Salmonella Replicate 1

This replicate was significantly different from the other Salmonella replicates from the PCA analysis. Compared to the other Salmonella replicates, there appears to be a group of genes under-expressed. This includes genes S100A9, FGB, S100A8, LBP and VWA1.

Using Enrichr 2, this group of genes relates to several pathways such as Endogenous Toll-Like Receptor signalling, involved in infection responses and IL-17 Signalling, involved in inflammation responses. These pathways were not detected in the GSEA analysis, however. Overall, the expression pattern is similar enough to the other replicates to warrant further investigation.

2.5.2 Salmonella Replicates 2 and 3

These two replicates were similar in their expression of various genes, with many relating to inflammation or infection responses.

2.6 Gene Set Enrichment Analysis

A Gene Set Enrichment Analysis (GSEA) was performed, in which some pathways 3 were related to infection and immunologic specific pathways, specifically:

  • GOBP_RESPONSE_TO_WOUNDING
  • GOBP_REGULATION_OF_RESPONSE_TO_WOUNDING
  • GOBP_INFLAMMATORY_RESPONSE
  • GOBP_REGULATION_OF_WOUND_HEALING

Some pathways were suggestive of tumor expression, such as GOBP_EPITHELIAL_CELL_PROLIFERATION. The presence of these may be confounding factors related to the HeLa cell line or HPV-18.

Overall, the most significant pathways had very close enrichment scores (within 10% of each other, for the top 100 pathways), with many pathways not obviously related to the experiment. Therefore, no obvious trend was revealed from the GSEA analysis, and support for the hypothesis was weak.

2.6.1 Top 5 Over and Under-Expressed Pathways

2.6.2 Top 5 GSEA Plots

The peak enrichment scores for the top 5 plots were very similar, although most of the pathways did not directly relate to the hypothesis being explored.

2.6.2.1 Plot 1

2.6.2.2 Plot 2

2.6.2.3 Plot 3

2.6.2.4 Plot 4

2.6.2.5 Plot 5

2.6.3 Significant Pathways

The significant pathways are shown in the table below. All have very close enrichment scores - from 1.88 for the first gene to 1.68 for the first 100 pathways. So the the most significant pathways are too similarly expressed to make meaningful distinctions between them.

3 DISCUSSION

The GSEA analysis showed several pathway expression relating to inflammation, immune and bacterial infection response, along with many other pathways with similar enrichment scores. Overall, there was very limited support for the hypothesis.

There was expression of some pathways potentially related to oncogenesis, which could possibly be related to the cell line. There is previous evidence to suggest highly aberrant expression of some pathways in this cell line 4. Also, the cell line contains Human Papilloma Virus 18 DNA, known to cause the majority of cervical neoplasms and also implicated in the cell line’s characteristic immortality. There is evidence of HPV-18 mRNA expression in HeLa cell lines5. These are all confounding factors which warrant further investigation.

The value of investigating individual gene expression with heatmaps or volcano plots for evaluating complex pathways appears limited, as gene expression and functional roles are heavily context dependent. The GSEA analysis was found to be slightly more effective in discovering the cellular responses to intracellular bacterial infection.

3.1 Future Lines of Enquiry

Given the nature of the cell-line used in this experiment, future experiments could be made with other, non-neoplastic epithelial cell lines, to try and eliminate the confounding problems of aberrant tumor cell expression and viral infection. Another dimension would be to use other intracellular bacteria to see how pathway expression differs across bacterial species. It is also unclear, at what stage of the infection the samples were taken, as this may affect gene expression, as well as the natural variation in infection life-cycles between individual cells.

From a methodological perspective, it is recommended to perform RNA-seq analyses with at least two methods, from toolsets such as DESeq2, EdgeR and Limma-Voom.

4 APPENDIX

4.1 Duplicate Reads in Data

A review of the DEseq data found a large number of read count duplicates (87.6% of counts has at least 1 duplicate), which gave warnings in the GSEA analysis. Duplicate reads are common in RNA-seq data sets, though the high degree of duplicate may be a cause for concern and may be caused by technical sequencing issues or a high abundance of a small number of genes.

After consideration, no action was taken to remove the duplicate values.

The table below shows duplicates of two or more counts in the raw assay data.

# assay count duplicates
assay_df <- data.frame(assay(rse_gene))
assay_df<- assay_df %>% arrange(desc(SRR1049363))
dupsCounts <- assay_df %>% group_by(SRR1049363) %>% filter(n() > 1)
head(dupsCounts, n = 12)
## # A tibble: 12 x 6
## # Groups:   SRR1049363 [6]
##    SRR1049363 SRR1049364 SRR1049365 SRR1049366 SRR1049367 SRR1049368
##         <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1      54302      45609      40415      21281      23656      29219
##  2      54302      61654      61103      10225       7706       9846
##  3      53022      55343      51375      34511      34428      41473
##  4      53022      58643      56965      24197      26196      31438
##  5      50472      58848      55352      55674      52378      64884
##  6      50472      52794      53197      30063      30721      38816
##  7      49502      48087      45419      50665      45347      54503
##  8      49502      55923      59935      19846      18815      21361
##  9      46223      49472      47237      13190      14856      17526
## 10      46223      54347      51152      29966      30935      38122
## 11      44507      55588      53645      48451      48295      56803
## 12      44507      52747      53455      16894      16916      20433
duplicated_n <- nrow(dupsCounts)
duplicated_percentage <- round(duplicated_n/nrow(assay_df) * 100, 1)
print(duplicated_percentage)
## [1] 87.6

4.2 Analysis of Distribution

DESeq2 assumes a negative normal distribution, so the data was checked to assess where this was a valid assumption. The dispersion plot shows that variances exceed the means, satisfying the requirement for accepting the negative binomial distribution assumption.

4.3 Software Version Details

The details below detail the relevant system version details used to produce this analysis.

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggpubr_0.4.0                clusterProfiler_4.0.0       msigdbr_7.4.1              
##  [4] pheatmap_1.0.12             EnhancedVolcano_1.10.0      here_1.0.1                 
##  [7] DT_0.18                     stringr_1.4.0               data.table_1.14.0          
## [10] tibble_3.1.2                dplyr_1.0.7                 BiocParallel_1.26.0        
## [13] gridExtra_2.3               plotly_4.9.4.1              ggrepel_0.9.1              
## [16] ggplot2_3.3.4               DESeq2_1.32.0               SummarizedExperiment_1.22.0
## [19] Biobase_2.52.0              MatrixGenerics_1.4.0        matrixStats_0.59.0         
## [22] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0         IRanges_2.26.0             
## [25] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.2.1        shadowtext_0.0.8       fastmatch_1.1-0       
##   [5] plyr_1.8.6             igraph_1.2.6           lazyeval_0.2.2         splines_4.1.0         
##   [9] crosstalk_1.1.1        digest_0.6.27          htmltools_0.5.1.1      GOSemSim_2.18.0       
##  [13] viridis_0.6.1          GO.db_3.13.0           fansi_0.5.0            magrittr_2.0.1        
##  [17] memoise_2.0.0          openxlsx_4.2.4         Biostrings_2.60.1      annotate_1.70.0       
##  [21] graphlayouts_0.7.1     extrafont_0.17         extrafontdb_1.0        enrichplot_1.12.1     
##  [25] colorspace_2.0-1       blob_1.2.1             haven_2.4.1            xfun_0.24             
##  [29] crayon_1.4.1           RCurl_1.98-1.3         jsonlite_1.7.2         scatterpie_0.1.6      
##  [33] genefilter_1.74.0      survival_3.2-11        ape_5.5                glue_1.4.2            
##  [37] polyclip_1.10-0        gtable_0.3.0           zlibbioc_1.38.0        XVector_0.32.0        
##  [41] DelayedArray_0.18.0    proj4_1.0-10.1         car_3.0-10             Rttf2pt1_1.3.8        
##  [45] maps_3.3.0             abind_1.4-5            scales_1.1.1           DOSE_3.18.1           
##  [49] DBI_1.1.1              rstatix_0.7.0          Rcpp_1.0.6             viridisLite_0.4.0     
##  [53] xtable_1.8-4           tidytree_0.3.4         foreign_0.8-81         bit_4.0.4             
##  [57] htmlwidgets_1.5.3      httr_1.4.2             fgsea_1.18.0           RColorBrewer_1.1-2    
##  [61] ellipsis_0.3.2         pkgconfig_2.0.3        XML_3.99-0.6           farver_2.1.0          
##  [65] locfit_1.5-9.4         utf8_1.2.1             tidyselect_1.1.1       labeling_0.4.2        
##  [69] rlang_0.4.11           reshape2_1.4.4         AnnotationDbi_1.54.1   cellranger_1.1.0      
##  [73] munsell_0.5.0          tools_4.1.0            cachem_1.0.5           cli_2.5.0             
##  [77] downloader_0.4         generics_0.1.0         RSQLite_2.2.7          broom_0.7.7           
##  [81] evaluate_0.14          fastmap_1.1.0          yaml_2.2.1             ggtree_3.0.2          
##  [85] babelgene_21.4         knitr_1.33             bit64_4.0.5            tidygraph_1.2.0       
##  [89] zip_2.2.0              purrr_0.3.4            KEGGREST_1.32.0        ggraph_2.0.5          
##  [93] nlme_3.1-152           ash_1.0-15             ggrastr_0.2.3          aplot_0.0.6           
##  [97] DO.db_2.9              rstudioapi_0.13        compiler_4.1.0         curl_4.3.2            
## [101] beeswarm_0.4.0         png_0.1-7              ggsignif_0.6.2         treeio_1.16.1         
## [105] tweenr_1.0.2           geneplotter_1.70.0     stringi_1.6.2          highr_0.9             
## [109] ggalt_0.4.0            forcats_0.5.1          lattice_0.20-44        Matrix_1.3-4          
## [113] vctrs_0.3.8            pillar_1.6.1           lifecycle_1.0.0        BiocManager_1.30.16   
## [117] cowplot_1.1.1          bitops_1.0-7           patchwork_1.1.1        qvalue_2.24.0         
## [121] R6_2.5.0               rio_0.5.27             KernSmooth_2.23-20     vipor_0.4.5           
## [125] MASS_7.3-54            assertthat_0.2.1       rprojroot_2.0.2        withr_2.4.2           
## [129] GenomeInfoDbData_1.2.6 hms_1.1.0              grid_4.1.0             tidyr_1.1.3           
## [133] rmarkdown_2.9          rvcheck_0.1.8          carData_3.0-4          ggforce_0.3.3         
## [137] ggbeeswarm_0.6.0

4.4 References